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1. Abstract 

In this paper we propose and study a technique to reduce the number of parameters and compu¬ 
tation time in fully-connected layers of neural networks using Kronecker product, at a mild cost of 
the prediction quality. The technique proceeds by replacing Fully-Connected layers with so-called 
Kronecker Fully-Connected layers, where the weight matrices of the FC layers are approximated 
by linear combinations of multiple Kronecker products of smaller matrices. In particular, given a 
model trained on SVHN dataset, we are able to construct a new KFC model with 73% reduction 
in total number of parameters, while the error only rises mildly. In contrast, using low-rank 
method can only achieve 35% reduction in total number of parameters given similar quality 
degradation allowance. If we only compare the KFC layer with its counterpart fully-connected 
layer, the reduction in the number of parameters exceeds 99%. The amount of computation is 
also reduced as we replace matrix product of the large matrices in FC layers with matrix prod¬ 
ucts of a few smaller matrices in KFC layers. Further experiments on MNIST, SVHN and some 
Chinese Character recognition models also demonstrate effectiveness of our technique. 

2. Introduction 

Model approximation aims at reducing the number of parameters and amount of computation of 
neural network models, while keeping the quality of prediction results mostly the same.^ Model 
approximation is important for real world application of neural network to satisfy the time and 
storage constraints of the applications. 

1. In some circumstances, as less number of model parameters reduce the effect of overfitting, model approxi¬ 
mation sometimes leads to more accurate predictions. 
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In general, given a neural network /(•; 6), we want to construct another neural network /(•; 6) 
within some pre-specified resource constraint, and minimize the differences between the outputs 
of two functions on the possible inputs. An example setup is to directly minimize the differences 
between the output of the two functions: 

inf E dif{x^;e)J{xi;e)), (1) 

i 

where d is some distance function and Xi runs over all input data. 

The formulation 1 does not give any constraints between the structure of / and /, meaning 
that any model can be used to approximate another model. In practice, a structural similar 
model is often used to approximate another model. In this case, model approximation may be 
approached in a modular fashion w.r.t. to each layer. 

2.1 Low Rank Model Approximation 

Low rank approximation in linear regression dates back to Anderson (1951). In Sainath et al. 
(2013), Liao et al. (2013), Xue et al. (2013), Zhang et al. (2014b), Denton et al. (2014), low rank 
approximation of fully-connected layer is used; and Jaderberg et al. (2014), Rigamonti et al. 
(2013), Lebedev et al. (2014) considered low rank approximation of convolution layer. Zhang 
et al. (2014a) considered approximation of multiple layers with nonlinear activations. 

We first outline the low rank approximation method below. The fully-connected layer widely 
used in neural network construction may be formulated as: 

La = li(La_iMa -|-ba), (2) 

where L^ is the output of the Ath layer of the neural network, Ma is often referred to as 
“weight term” and ba as “bias term” of the a-th layer. 

As the coefficients of the weight term in the fully-connected layers are organized into matrices, 
it is possible to perform low-rank approximation of these matrices to achieve an approximation 
of the layer, and consequently the whole model. Given Singular Value Decomposition of a matrix 
M = UDV*, where U,V are unitary matrices and D is a diagonal matrix with the diagonal 
made up of singular values of M, a rank-A: approximation of M G jg; 

M « Mfc = UDV* , where U G D G V G (3) 

where U and V are the first /c-columns of the U and V respectively, and D is a diagonal 
matrix made up of the largest k entries of D. 

In this case approximation by SVD is optimal in the sense that the following holds Horn and 
Johnson (1991): 


= inf ||X-M||f s.t. rank(X)<r, 


and 


M,. = inf IIX-MII 2 s.t. rank(X)<r. 
The approximate fully connected layer induced by SVD is: 

z = La-lU 
La = MZDV*+ba), 


(4) 

(5) 

( 6 ) 
(7) 
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In the modular representation of neural network, this means that the original fully connected 
layer is now replaced by two consequent fully-connected layers. 

However, the above post-processing approach only ensures getting an optimal approximation 
of M under the rank constraint, while there is still no guarantee that such an approximation is 
optimal w.r.t. the input data. I.e., the optimum of the following may well be different from the 
rank-r approximation w.r.t. some given input X: 

inf ||ZX-MX||f s.t. rank(Z) < r. (8) 

Hence it is often necessary for the resulting low-rank model j\4 to be trained for a few more 
epochs on the input, which is also known as the “fine-tuning” process. 

Alternatively, we note that the rank constraint can be enforced by the following structural 
requirement for X G 

rank(X) < r 3A G B G s.t. X = AB. (9) 

In light of this, if we want to impose a rank constraint on a fully-connected layer L(M,g) 
in a neural network where M G F™^", we can replace that layer with two consecutive layers 
Li(B,(/i) and L 2 {A,g 2 ), where gi{x) = x, g 2 = g, and M = AB where A G F™^”, B G F’’^", 
and then train the structurally constrained neural network on the training data. 

As a third method, a regularization term inducing low rank matrices may be imposed on the 
weight matrices. In this case, the training of a /c-layer model is modified to be: 

k 

i j=l 

where r is the regularization term. For the weight term of the FC layers, conceptually we 
may use the matrix rank function as the regularization term. However, as the rank function is 
only well-defined for infinite-precision numbers, nuclear norm may be used as its convex proxy 
Jaderberg et al. (2014), Recht et al. (2010). 

3. Model Approximation by Kronecker Product 

Next we propose to use Kronecker product of matrices of particular shapes for model approxi¬ 
mation in Section 3.1. We also outline the relationship between the Kronecker product approxi¬ 
mation and low-rank approximation in Section 3.2. 

Below we measure the reduction in amount of computation by number of floating point oper¬ 
ations. In particular, we will assume the computation complexity of two matrices of dimensions 
M X K and A x A to be 0{MKN), as many neural network implementations Bastien et al. 
(2012), Bergstra et al. (2010), Jia et al. (2014), Collobert et al. (2011) have not used algorithms 
of lower computation complexity for the typical inputs of the neural networks. Our analysis is 
mostly immune to the “hidden constant” problem in computation complexity analysis as the 
underlying computations of the transformed model may also be carried out by matrix products. 

3.1 Weight Matrix Approximation by Kronecker Product 

We next discuss how to use Kronecker product to approximate weight matrices of FC layers, 
leading to construction of a new kind of layer which we call Kronecker Fully-Connected layer. 
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The idea originates from the observation that for a matrix M G where the dimensions are 

not prime we have approximations like: 

M = Ml (g) M 2 , (11) 

where m = mim 2 , n = nin 2 , Mi G 'pmixni^ ]y|-^ g 

Any factors of m and n may be selected as rrii and ni in the above formulation. However, in 
a Convolutional Neural Network, the input to a FC layer may be a tensor of order 4, which has 
some natural shape constraints that we will try to leverage in 3.1.1. Otherwise, when the input 
is a matrix, we do not have natural choices of mi and ni. We will explore heuristics to pick mi 
and ni in 3.1.7. 

3.1.1 Kronecker product approximation for fully-connected layer with 4D 
TENSOR INPUT 

In a convolutional layer processing images, the input data L^-i may be a tensor of order 4 
as Tnchw where n = 1, 2, • • • ,N runs over N different instances of data, c = 1, 2, • • • , C runs 
over C channels of the given images, h = 1,2,--- ,i7 runs over H rows of the images, and 
w = 1,2, ■ ■ ■ ,W runs over W columns of the images. T is often reshaped into a matrix before 
being fed into a fully connected layer as D„j, where n = 1, 2, • • • ,N runs over the N different 
instances of data and j = 1, 2, • • • , CHW runs over the combined dimension of channel, height, 
and width of images. The weights of the fully-connected layer would then be a matrix Mj^ 
where j = 1, 2, • • • , CHW and k = 1, 2, • • • ,K runs over output number of channels. I.e., the 
layer may be written as: 

D = Reshape(Lo_i) (12) 

La = h(DM-kb). (13) 

Though the reshaping transformation from T to D does not incur any loss in pixel values 
of data, we note that the dimension information of the tensor of order 4 is lost in the matrix 
representation. As a consequence, M has CHWK number of parameters. 

Due to the shape of M, we may propose a few kinds of structural constraint on M by requiring 
M to be Kronecker product of matrices of particular shapes. 

3.1.2 Formulation I 

In this formulation, we require M = Mi g) M 2 g) M 3 , where Mi G F‘^^^bM 2 G G 

and K = K 1 K 2 K 3 . The number of parameters is reduced to CKi + HK 2 -I- WK^,. The 
underlying assumption for this model is that the transformation is invariant across rows and 
columns of the images. 

3.1.3 Formulation II 

In this formulation, we require M = Mi g) M 2 , where Mi G F‘^^^bM 2 G 
K = KiK 2 . The number of parameters is reduced to CKi + HWK 2 . The underlying assump¬ 
tion for this model is that the channel transformation should be decoupled from the spatial 
transformation. 

2. In case any of m and n is prime, it is possible to add some extra dummy feature or output class to make the 
dimensions dividable. 
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3.1.4 Formulation III 

In this formulation, we require M = Mi 0 M 2 , where Mi G G and 

K = KiK 2 . The number of parameters is reduced to CHKi + WK 2 . The underlying assumption 
for this model is that the transformation w.r.t. columns may be decoupled. 

3.1.5 Formulation IV 

In this formulation, we require M = Mi 0 M 2 , where Mi G G and 

K = KiK 2 . The number of parameters is reduced to CWKi+HK 2 . The underlying assumption 
for this model is that the transformation w.r.t. rows may be decoupled. 

3.1.6 Combined Formulations 

Note that the above four formulations may be linearly combined to produce more possible kinds 
of formulations. It would be a design choice with respect to trade off between the number of 
parameters, amount of computation and the particular formulation to select. 

3.1.7 Kronecker product approximation for matrix input 

For fully-connected layer whose input are matrices, there does not exist natural dimensions to 
adopt for the shape of smaller weight matrices in KFC. Through experiments, we find it possible 
to arbitrarily pick a decomposition of input matrix dimensions to enforce the Kronecker product 
structural constraint. We will refer to this formulation as KFCM. 

Concretely, when input to a fully-connected layer is X G F^^*^ and the weight matrix of the 
layer is W G we can construct approximation of W as: 

W = Wi(g)W2«W, (14) 

where C = C 1 C 2 , K = K 1 K 2 , Wi G and W 2 G 

The computation complexity will be reduced from 0{NCK) to 0{NCK{-^ + ^)) = 
0{NC 2 C 1 K 1 + NC 2 K 1 K 2 ), while the number of parameters will be reduced from CK to CiKi + 
C 2 K 2 . 

Through experiments, we have found it sensible to pick Ci « \/C and Ki m '/K. 

As the choice of Ci and Ki above is arbitrary, we may use linear combination of Kronecker 
products if matrices of different shapes for approximation. 


W = ^ Wy O W 2 , « W, (15) 

j=i 

where Wy G F'^ux^u and W 2 j G F‘" 2 jXiC 2 j^ 

3.2 Relationship between Kronecker Product Constraint and Low Rank Constraint 

It turns out that factorization by Kronecker product is closely related to the low rank approxi¬ 
mation method. In fact, approximating a matrix M with Kronecker product Mi 0 M 2 of two 
matrices may be casted into a Nearest Kronecker product Problem: 

inf IIM — Ml (g) M 2 ||f- (16) 

Ml,M2 
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An equivalence relation in the above problem is given in Van Loan and Pitsianis (1993), 
Van Loan (2000) as: 

arg inf IIM — Mi 0 M 2 ||_f = arg inf ||7^(M) — vecMi(vecM 2 )^||F, (17) 

Ml,M2 Ml,M2 

where 77.(M) is a matrix formed by a fixed reordering of entries M. 

Note the right-hand side of formula 17 is a rank-1 approximation of matrix 77.(M), hence has 
a closed form solution. However, the above approximation is only optimal w.r.t. the parameters 
of the weight matrices, but not w.r.t. the prediction quality over input data. 

Similarly, though there are iterative algorithms for rank-1 approximation of tensor Friedland 
et al. (2013), Lathauwer et al. (2000), the optimality of the approximation is lost once input data 
distribution is taken into consideration. 

Hence in practice, we only use the Kronecker Product constraint to construct KFC layers 
and optimize the values of the weights through the training process on the input data. 

3.3 Extension to Sum of Kronecker Product 

Just as low-rank approximation may be extended beyond rank-1 to arbitrary number of ranks, one 
could extend the Kronecker Product approximation to Sum of Kronecker Product approximation. 
Concretely, one not the following decomposition of M: 

rank(7?,(M)) 

M= ^ (18) 

Hence it is possible to find fc-approximations: 

k 

(19) 

i=l 

We can then generalize Formulation I-IV in 3.1 to the case of sum of Kronecker Product. 
We may further combine the multiple shape formulation of 15 to get the general form of KFC 
layer: 

J k 

M « A-ij 0 By. (20) 

j=ii=i 

where Ay G and By G 

4. Empirical Evaluation of Kronecker product method 

We next empirically study the properties and efficacy of the Kronecker product method and 
compare it with some other common low rank model approximation methods. 

To make a fair comparison, for each dataset, we train a covolutional neural network with 
a fully-connected layer as a baseline. Then we replace the fully-connected layer with different 
layers according to different methods and train the new network until quality metrics stabilizes. 
We then compare KFC method with low-rank method and the baseline model in terms of number 
of parameters and prediction quality. We do the experiments based on implementation of KFC 
layers in TheanoBergstra et al. (2010), Bastien et al. (2012) framework. 
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As the running time may depend on particular implementation details of the KFC and the 
Theano work, we do not report running time below. However, there is no noticeable slow down in 
our experiments and the complexity analysis suggests that there should be significant reduction 
in amount of computation. 

4.1 MNIST 

The MNIST datasetLeCun et al. (1998) consists of 28 x 28 grey scale images of handwritten 
digits. There are 60000 training images and 10000 test images. We select the last 10000 training 
images as validation set. 

Our baseline model has 8 layers and the first 6 layers consist of four convolutional layers and 
two pooling layers. The 7th layer is the fully-connected layer and the 8th is the softmax output. 

The input of the fully-connected layer is of size 32 x 3 x 3, where 32 is the number of channel 
and 3 is the side length of image patches(the mini-batch size is omitted). The output of the 
fully-connected layer is of size 256, so the weight matrix is of size 288 x 256. 

CNN training is done with AdamKingma and Ba (2014) with weight decay of 0.0001. DropoutHin- 
ton et al. (2012) of 0.5 is used on the fully-connected layer and KFC layer, y = | tanha;| is used 
as activation function. Initial learning rate is le — 4 for Adam. 

Test results are listed in Table 1. The number of layer parameters means the number of 
parameters of the fully-connected layer or its counterpart layer (s). The number of model param¬ 
eters is the number of the parameters of the whole model. The test error is the min-validation 
model’s test error. 

In Cut-96 method, we use 96 output neurons instead of 256 in fully-connected layer. In the 
LowRank-96 method, we replace the fully-connected layer with two fully-connected layer where 
the first FC layer output size is 96 and the second FC layer output size is 256. In the KFC-II 
method, we replace the fully-connected layer with KFC layer using formulation II with Ki = 64 
and K 2 = 4. In the KFC-Combined method, we replace the fully-connected layer with KFC 
layer and linear combined the formulation II, III and IV)^'! = 64,7^2 = 4 in formulation II, 

Ki = 128,7^2 = 2 in formulation III and IV). 


Table 1: Comparison of using Low-Rank method and using KFC layers on MNIST dataset 


Methods 

# of Layer Params(%Reduction) 

^ of Model Params(%Reduction) 

Test Error 

Baseline 

74.0K 

99.5K 

0.51% 

Cut-96 

27.8K(62.5%) 

51.7K(48.1%) 

0.58% 

LowRank-96 

52.6K(39.0%) 

78.1K(21.6%) 

0.54% 

KFC-11 

2.1K(97.2%) 

27.7K(72.2%) 

0.76% 

KFC- 

Combined 

27.0K(63.51%) 

52.5K(47.2%) 

0.57% 


4.2 Street View House Numbers 

The SVHN datasetNetzer et al. (2011) is a real-world digit recognition dataset consisting of 
photos of house numbers in Google Street View images. The dataset comes in two formats 
and we consider the second format: 32-by-32 colored images centered around a single character. 
There are 73257 digits for training, 26032 digits for testing, and 531131 less difficult samples 
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which can be used as extra training data. To build a validation set, we randomly select 400 
images per class from training set and 200 images per class from extra training set as Sermanet 
et al. (2012), Goodfellow et al. (2013) did. 

Here we use a similar but larger neural network as used in MNIST to be the baseline. The 
input of the fully-connected layer is of size 256 x 5 x 5. The fully-connected layer has 256 output 
neurons. Other implementation details are not changed. Test results are listed in Table 3. 
In the Cut-A^ method, we use N output neurons instead of 256 in fully-connected layer. In 
the LowRank-A^ method, we replace the fully-connected layer with two fully-connected layer 
where the first FC layer output size is N and the second FC layer output size is 256. In the 
KFC-II method, we replace the fully-connected layer with KFC layer using formulation II with 
Ki = 64 and K 2 = 4. In the KFC-Combined method, we replace the fully-connected layer with 
KFC layer and linear combined the formulation II, III and IV(iFi = 64, K 2 = 4 in formulation 
II, Ki = 128, K 2 = 2 in formulation III and IV). In the KFC-RankA^ method, we use KFC 
formulation II with Ki = 64, K 2 = 2 and extend it to rank N with as described above. 


Table 2: Comparison of using Low-Rank method and using KFC layers on SVHN dataset 


Methods 

# of Layer Params(%Reduction) 

# of Model Params(%Reduction) 

Test Error 

Baseline 

1.64M 

2.20M 

2.57% 

Cut-128 

0.82M(50.0%) 

1.38M(37.3%) 

2.79% 

Cut-64 

0.41(25.0%) 

0.97(55.9%) 

3.19% 

LowRank-128 

0.85M(48.2%) 

1.42M(35.7%) 

3.02% 

LowRank-64 

0.43M(73.7%) 

0.99M(55.1%) 

3.67% 

KFC-II 

0.016M(99.0%) 

0.58M(73.7%) 

3.33% 

KFC- 

Combined 

0.34M(79.3%) 

0.91M(58.6%) 

2.60% 

KFC-RanklO 

0.17M(89.6%) 

0.73M(66.8%) 

3.19% 


4.3 Chinese Character Recognition 

We also evaluate application of KFC to a Chinese character recognition model. Our experiments 
are done on a private dataset for the moment and may extend to other established Chinese 
character recognition datasets like HCL2000(Zhang et al. (2009)) and CASIA-HWDB(Liu et al. 
(2013)). 

For this task we also use a convolutional neural network. The distinguishing feature of the 
neural network is that following the convolution and pooling layers, it has two FC layers, one 
with 1536 hidden size, and the other with more than 6000 hidden size. 

The two FC layers happen to be different type. The 1st FC layer accepts tensor as input and 
the 2nd FC layer accepts matrix as input. We apply KFC-I formulation to 1st FC and KFCM 
to 2nd FC. 

It can be seen KFC can significantly reduce the number of parameters. However, in case 
of “KFC and KFCM (rank=l)”, this also leads to serious degradation of prediction quality. 
However, by increasing the rank from 1 to 10, we are able to recover most of the lost prediction 
quality. Nevertheless, the rank-10 model is still very small compared to the baseline model. 






Table 3: Effect of using KFC layers on a Chinese recognition dataset 


Methods 

%Reduction 
of 1st FC 
Layer Params 

%Reduction 
of 2nd FC 
Layer Params 

%Reduction 
of Total 

Params 

Test Error 

Baseline 

0% 

0% 

0% 

10.6% 

KFC-II 

99.3% 

0% 

36.0% 

11.6% 

KFC-KFCM- 

rankl 

98.7% 

99.9% 

94.5% 

21.8% 

KFC-KFCM- 

rank-lO 

93.3% 

99.1% 

91.8% 

13.0% 


5. Conclusion and Future Work 

In this paper, we propose and study methods for approximating the weight matrices of fully- 
connected layers with sums of Kronecker product of smaller matrices, resulting in a new type 
of layer which we call Kronecker Fully-Connected layer. We consider both the cases when input 
to the fully-connected layer is a tensor of order 4 and when the input is a matrix. We have 
found that using the KFC layer can significantly reduce the number of parameters and amount 
of computation in experiments on MNIST, SVHN and Chinese character recognition. 

As future work, we note that when weight parameters of a convolutional layer is a tensor of 
order 4 as T G it can be represented as a collection of i/ x IF matrices We can 

then approximate each matrix by Kronecker products as = Ahw C* following KFCM 
formulation, and apply the other techniques outlined in this paper. It is also noted that the 
Kronecker product technique may also be applied to other neural network architectures like Re¬ 
current Neural Network, for example approximating transition matrices with linear combination 
of Kronecker products. 
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